DocParser: Hierarchical Document Structure Parsing from Renderings
نویسندگان
چکیده
Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring complete structure documents missing. As remedy, we developed “DocParser”: an end-to-end system for parsing – including all text elements, nested figures, tables, and table cell structures. Our second contribution provide dataset evaluating parsing. third propose scalable learning framework settings where domain-specific data are scarce, which address by novel weak supervision that significantly improves performance. experiments confirm effectiveness our proposed supervision: Compared baseline without supervision, it mean average precision detecting entities 39.1% F1 score classifying relations 35.8%.
منابع مشابه
Multi-Document Discourse Parsing Using Traditional and Hierarchical Machine Learning
Multi-document handling is essential today, when many documents on the same topic are produced, especially considering the Web. Both readers and computer applications can benefit from a discourse analysis of this multidocument content, since it demonstrates clearly the relations among portions of these documents. This work aims to identify such relations automatically using machine learning tec...
متن کاملImpact of Document Structure on Hierarchical Summarization
Hierarchical summarization technique summarizes a large document based on the hierarchical structure and salient features of the document. Previous study has shown that hierarchical summarization is a promising technique which can effectively extract the most important information from the source document. Hierarchical summarization has been extended to summarization of multiple documents. Thre...
متن کاملHierarchical Word Structure-based Parsing: A Feasibility Study on UD-style Dependency Parsing in Japanese
In applying word-based dependency parsing such as Universal Dependencies (UD) to Japanese, the uncertainty of word segmentation emerges for defining a word unit of the dependencies. We introduce the following hierarchical word structures to dependency parsing in Japanese: morphological units (a short unit word, SUW) and syntactic units (a long unit word, LUW). This paper describes the results o...
متن کاملHierarchical Search for Parsing
Both coarse-to-fine and A∗ parsing use simple grammars to guide search in complex ones. We compare the two approaches in a common, agenda-based framework, demonstrating the tradeoffs and relative strengths of each method. Overall, coarse-to-fine is much faster for moderate levels of search errors, but below a certain threshold A∗ is superior. In addition, we present the first experiments on hie...
متن کاملDetection of Malicious PDF Files Based on Hierarchical Document Structure
Malicious PDF files remain a real threat, in practice, to masses of computer users, even after several high-profile security incidents. In spite of a series of a security patches issued by Adobe and other vendors, many users still have vulnerable client software installed on their computers. The expressiveness of the PDF format, furthermore, enables attackers to evade detection with little effo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i5.16558